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Applications of Cluster Analysis in 
Natural Resources Research 


BRIAN J. TURNER 

Abstract. Clnslei analysis is useful fur subletting a nni lli • vurlatc multi-observational set 
of Uuia into homogeneous groups- The technique bus wide application as shown by three 
examples: (a ) constructing an abbreviated telephone interview- from analysis of re- 
sponses to a mailed questionnaire; (U) combining tree volume tables over species; and 
(c) forming spectral signatures from multispictr.il scanner data, Tire user is required to 
cxcicise judgment in the choiee of tdgurithm, eiitciion, and number of groups. Tire 
selection of tire best grouping is therefore subjective. Its greatest value is perhaps as a 
precursor to more objective analytical techniques, Forosl Scl. 20,-343-349. 

Additional hoy words. Questionnaire analysis, tree volume, remote sensing. 


Tim Urgb To Classify h thoroughly hu- 
man, and classification is perhaps the must 
commonly used method of dealing with the 
unknown, Our first approach in identify- 
ing nn unfamiliar set of objects is to try to 
associate it with a known class of objects, 
thereby reducing the dimensionality of the 
unknown. If this fails because of lire great 
variation among the unfamiliar objects, tl 
may be instructive to subset these so that ob- 
jects within a subset are more closely allied 
with eaclt other than with any member of 
any other subset. If the subsets cannot still 
be associated with any known classes, at 
least it is helpful to know the structure 
associated with the set so that initial atten- 
tion can be concentrated on the more im- 
portant problem of identifying the group, 
rather than individual dilfcrcuccs. 

Cluster analysis is concerned with this 
subsetting problem. The basic premise of 
cluster analysis is that objects shuuld be 
placed in the same croup if measurements 
of variables associated with these objects 
are highly similar. Flits implies that there 
should be small variances within a group 
and large variances between group's. Fur- 
thermore if one considered the value of 
each object as a point in multi-ditnousk-nal 
space, where each dimension represents a 
measured variable, (lien objects within a 
group should be clo-e together and clearly 
distant loan objects in other group* Al- 


ternatively, as in the first application be- 
low, it may be desired to subset the vari- 
ables measured rather than the objects. 
In cither case, the aim is to form subsets 
of the data that have high internal con- 
sistency and maximum separability from 
other subsets. 

Cluster analysis techniques arc increas- 
ingly being used in taxonomic, ecological, 
and marketing research, The three appli- 
cations given here are not drawn from 
these areas, but represent unusual and very 
different applications of the technique to 
elassificalory problems in natural resources 
research. 

Numerous algorithms have been sug- 
gested for optimally partitioning the data 
set created by taking one or more measure- 
ments on "ach of a set of objects. Many 
of these algorithms have been implemented 
in computer programs and tire potential 
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user of a vhrter atplysh technique will 
probably find that lie has to make a choice 
between pioc'iati^ and then between op- 
tions within a program. This paper may 
help the investigator in making these 
choices because each of the applications de- 
scribed uses a different ■Igoiitlnn. How- 
ever, I do not suggest that the techniques 
applied here are the best or only appropri- 
ate methods; just that they have given use- 
ful results in these eases. 

First Application: Questionnaire Analysis 

We mailed questionnaires in 1971 to al- 
most 2U0 land owners in Warren County, 
Pennsylvania, in a study of the character- 
istic and management objectives of land- 
owners who had recently acquired land in 
this region (Turner ef uk J973), 

Only 51 questionnaires were completed 
and returned. This rather low success rate 
made it necessary to sample the non-re- 
spondents. Wc did this by telephone, 
using an abbreviated questionnaire as a 
basis for interviewing. 

in considering die design of the tele- 
phone interview, we reasoned that if we 
performed a duster analysis on die ques- 
tions as answered by the mail respondents, 
we. should find certain question-responses 
that tended to group together. If this w rut 
so, then we could presumably infer the 
answer to any question in the group from 
the answer to one question within the 
group. 

Educationists involved in testing proce- 
dures use the technique of “item analysis" 
for the similar problem of forming groups 
of questions ("item pools") su that a test 
made up of one question from each group 
will have minimum redundancy in subject 
matter cover age. We therefore used a com- 
puter program prepared lor this type ot 
analysis. The program had the additional 
advantage ul situphciiy Irom the users 
point of view, nuking it appropriate lor 
our initial investigations oi the technique. 
The clustering algor? "im 'h) is based upon 
the work of Loevincrr it ( 1 953 .) 
suggested tint groups couM K- built up In 
first selecting that trq ! •: ul vai table* 
("items") that had m»:’n M covariances 
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anv’iig themselves, then adding variables 
seqiMiti il'v to ttiasiiui/e the mtio of the 
Mint of these eovari mccs to the sum ot the 
respective variances Items that would re- 
duce this “covariance rutin" arc discarded. 
When all items have been considered and 
either included in the group or discarded, 
a new group is begun in the same way us- 
ing the items discarded from the previous 
group, mid so on until sill variables have 
been placed in a group, or a "residue" re- 
mains, This algorithm is implemented in 
a computer program called TEST07 avail- . 
able from the University of Alberta, Ed- 
monton, Alberta, Canada. 

My considering the coded responses to 
the questionnaire questions as being the 
variables in this method, we were able to 
obtain groupings of questions-responscs. 
After selecting at least one question from 
each group, vve were able to formulate a 
much-abbreviated questionnaire suitable 
for telephone interviews. From the rc- 
potiscs to these questions and considering 
the groupings of the questions, wc felt con- 
fident that vve could infer the answers to 
the omitted questions. 

Second App/icut/ua: 

Volume Tab/v Construction 

In preparing a set of volume tables for the 
commercial forest species of Pennsylvania 
(Turner 1972a), 1 developed cubic-foot 
and bouid-fuut volume equations for 21 
species or species-groups. Where these 
equations are used in computer programs 
for volume computations, there is no real 
advantage !rr amalgamating species equa- 
tions; and (here would undoubtedly be 
some loss of precision. However, for less 
pi cc' jC field use there could be some ad- 
vantage in reducing the number of tables 
to a smaller set. 

I he models for estimating volume were 
uf the form : 

V = k, ! 6-0-7/ 

for board-fool volume, and 

r - + ;•_./> (/'•> 

for eubic-foot volume, 

where 


3 



Figure I. Sculler diagram of board-fool i chime coefficients lb, and bd for diffcicnl species, and 
die four groups ns dr fined by cluster analysis (ter Table I for interpi elation of specie t symbols ). 


V -* predicted volume, 
i) — dbhob, 

II - the appropriate merchantable 
height, and the 

b, = estimated regression coefficients. 

A suitable basis for grouping species equa- 
tions would therefore be the estimation 
parameters, />,. The cluster analysis pro- 
gram described by Rubin and Friedman 
( 1 967 ) was used. I his program is actu- 
ally a comprehensive ('ad age of clustering 
algorithms with options for considering 
confinuo'is « r discrete data and for using 
a number of different criteria. 


This method of subsetting continuous 
data (also described by Friedman and 
Rubin 1967) is based on the partitioning 
of the total scatter matrix, 

T - A" .V where X is the n > p matrix 
of u observations on p vari- 
ables standardized to mean 
zero and variance = 1, 

into v groups such that some scalar prop- 
erty of the pooled within-group scatter 
matrix (IF) or of the hclwccn-groups 
scatter matrix ( H ) is optimized The 
choice of the number of groups, g, and the 
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I'icimr. 2. Logarithm of cluster analysis criterion. 
(jT|/|Wi), plotted against number of groups 
formed for both board foot volume coefficients 
(solid line) and first two principal components 
of cubic-firot volume coefficients (dashed line). 


most appropriate criterion is left to the 
user. 

This posts one of the dilemmas of clus- 
ter analysis techniques Occasionally the 
user may know how many groups into 
which he wants to subset the data; more 
often he may have a desirable range of 
g-valucs. The clustering methods will not 
give the best va 1 "-. of p. although some- 
times this may be indicated by a sudden 
increase in the optimal criterion value js 
p is varied. Furthermore, diftcrent criteria 
for a given j?- value give differing group 
compositions. The user may wish to try 
different criteria and look (or consistent 


groupings, and lempci this with his intu- 
itive feel for the structure of the data before 
making final groupings. Alternatively, he 
may rely on the findings of Scott and 
Symons (l‘J7l) and Marriott (1^71) that 
lor many types of continuous data the lies! 
criterion is to minimize the determinant 
of W ( | It'D. 

In the ease of the volume table analysis, 
I decided that fewer Ilian four species- 
group tables would not cover the variation 
adequately, whereas more than six would 
negate the value of a reduced set of tables. 
Several different criteria were used; only 
some arc reported here. 

'Ihc board-foot volume coefficients were 
analyzed first because the fact that only two 
variables arc involved permitted a ready 
comparison between the solutions and a 
scatter plot of hi against h t (Fig I). The 
criterion used was: maximize (|T|/|IF|), 
which is equivalent to minimizing \W\ be- 
cause |Tj is constant for a data set. When 
the logarithms of the maximum criterion 
values for different g values were plotted 
against p, it was evident that the increases 
in the criterion value from g = 3 to 4 and 
p - 4 to 5 were similar and substantially 
greater than the increase from p = 5 to 6 
(Fig. 2). This gave an objective basis for 
rejecting p = 6. Assuming that, all other 
things being equal, the least number of 
volume tables, the better; p ~ 4 was pre- 
ferred over g — 5. The optimum composi- 
tion of the groups for p = 4 is shown in 
Figure I. 

Because there were significant intcr- 
coi relations among the four cubic-foot 
volume coefficients, a reduction in com- 
puter processing time seemed possible by 
first taking principal components (an op- 
tion available with this program) and then 
clustering a reduced set of components. 
The first two components explained 89 per- 
cent of the variation; hence cluster analysis 
was performed on these. By using the 
( 7'1/jJF ) criterion again, maximum values 
were obtained for varying g-valucs and 
their logarithms plotted against p (Fig. 2). 
This indicated that p = 4 or 6 wcie suit- 
able values, so again, four groups were 
chosen. Ihc composition of the groups is 
given in Table I. 
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l’AHl. I I. Optimal i/vciV* grouping fm cithit-fuot v « flumes bused on ilu\ltt analysis 
>>l equation eoefluiettls. 


Gump 1 

tiioiip II 

< mmp III 

Citoiip IV 

While pine (VVI* | 

Kol pine IK I'l 

YcIIum Ixiih (Vlil 

Pitch pme (PP) 

Itemluck (Ml 1 

Sugar iim pic (SMl 

Sweet hi it'll (Sit) 


Misc uiftwootK ( MS) 

Kcil maple (KM) 

ll | HI i 


black oak ( HO) 

Kid oak (KOI 

Aspen (AS) 


Chestnut oak (CO) 
Yellow poplar (VP) 
Mm hurilwoodi (Mil) 

Scat let oak (SO) 
While oak (WO) 
White ath (WA) 
Havswooil (II A) 

black cherry (l«< ) 



As a check on the groupings, cubic-foot 
volumes were read from the tables for three 
representative dbh's and heights. Mean 
solunies and standard deviations were com- 
puted for each of the three tree sizes of 
each group (Table 2). Although the stan- 
dard deviations are somewhat overlapping, 
there does appear to be on this basis more 
difference between groups than within 
them. 

The normal statistical procedure for test- 
ing whether simple linerr regression models 
might be combined is to test for slope and 
intercept differences in an analysis of vari- 
ance (for example, Williams 1959), and 
combine models on the basis of multiple 
range tests on the regression coefficients. 
In the case of the board-foot volume 
models, the differences between slope 
coefficients ( b 2 ) were so large in compar- 
ison with their standard deviations that 
such a procedure would have reduced the 
number of equations only by about two. 
If a similar statistical procedure were 
available for testing the non-linear cubic- 
foot \olumc models, a similar situation 
would likely be found. Cluster analysis 
provides a means, therefore, for grouping 


species for volume estimation pur- 
poses where it is desirable to have only a 
few different equations or tables. It should 
be recognized, however, that there will be 
a loss in precision over using individual 
species equations, because the groups do 
not represent a statistically homogcncois 
population. 

Third Application: Analysis of 
Multispsctral Scanner (MSS) 
Remote-Sensed Data 

The increasing availability of MSS remote- 
sensed data from airborne and space plat- 
forms is making automated land-use and 
vegetative mapping, a reality. Mapping 
methods require that representative spectral 
signatures be obtained for each target of 
interest (land-use class, species type, and 
so on) and then each MSS response cle- 
ment be classified •e'-ording to which tar- 
get signature isr r i . „• 

A response r' • -n a digital tape de- 
rived from MSS e. msists of the spectral 
reponse (or specs, a. signature) from a 
point on the ground. The response is a 
vector of size equal to the number of kinds 
into which the visible and near-visible 


TAM I. 2. Ci hit volumes lor three tree sizes, with 21 species equations clustered into 
four groups (converted to metric units). 


l)hh 
( cm i 

Meichantahlr 

height 

tin 2 4 m halts ) 


Mean Stnnilaul Deviation (m 1 ) 


Group 1 
(7 species) 

Ci 1 Oil p II 
18 species l 

Gioup III 
(5 species) 

Group IN' 
(I vpeeics) 

12.7 


0.05) 2 : 0 006 

0.056 2. 0.006 

0 062 -t 0.006 

0 022 

3V6 

7 

HIS* .022 

862 ± .020 

.907 ± .0) t 

.823 

61.0 

10 

2.965 ± .076 

J.254 ± 120 

5.6)2 ± 238 

3.349 
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hlcuHi J. Portion of plotter line map of Stone Valley h \ pertmrru.il P orest amt surroundings. Hunt 
ingilon C ounty, Pennsylvania, derived from computer protesting of PHIS I satellite MSS data. The 
a>ea shou n represents about MO ha. Coordinates refer to stun line an t element numbers. Types 
depicted are: hardwoods (vertical lines!, shaded hardwoods (right to-left shuliei), hernlovk-hardwiHul 
(horizontal lines), conifers (le/t-to right slashes), water (-f'j), fields (open areas), unclassified (X's) 


spectrum is split by the scanner. The value 
of the vector is the intensity of radiation 
received by the scanner in each spectral 
band from the gcog.aphieal point. The 
si/e of the point varies with the altitude of 
the scanner and other factors, for ERTS 1 
satellite data it is about 0.5 ha The data 
can be arranged on a digital tape in geo- 
graphic relationship so that a scene of in- 
terest can be selected by referencing the 
boundary scan lines (perpendicular to the 
flight line) and elements within scan lines. 

One method for obtaining representative 
spectral signatures is cluster analysis. As 
a component of a system of computer pro- 
grants for analyzing MSS digital tapes 
(Itorden 1972). I wrote a program using 
a cluster analysis algorithm (Turner 
I y 72b ) which is now being used in the 
analysis of data from ERfS-l as well .is 
aircraft data. 


I he clustering algorithm was influenced 
by the “iterative condensation on centroids” 
procedure (Tryon and Hailey 1970), a use- 
ful method when the number of observa- 
tions is very large. The method has, how- 
ever. been freely ad; pfed for computational 
efficiency, since we need only the mean and 
variance of each group or target-type and 
not the elemental composition of each 
group. 

I lie algorithm uses the first scan line of 
the scene of interest to form (rial centroids 
or mean spectral signatures. The number 
of these formed is controlled by the user 
who specifics a critical separation value for 
the formation of a new centroid. The 
whole scene is then sampled and the mean 
spectral signatures arc revised, and added 
to if necessary, on the basis of the sampled 
elements and variances computed from 
the data. 1 nus while the intensity of 
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clossificatism is under uscr-cimlrul, vari- 
ability in the (ample data is used to asvtsl 
in the ovtigning «»( cl.nu.nts to group*. 
For iuManev. a "mixed hardwoods" group 
will have more inherent variability than a 
"pure conifer” group. I he output from the 
prr '-ram is tluo a vet of mean spectral 
signature* with a measure of the variabil- 
ity within each group. I he scene can then 
be rcprr«cct$cd to produce a line printer 
character map with characters assigned 
according to the computed mean spectral 
responses and critical values derived from 
the variability measures. Alternatively, 
the output data can be combined with 
spectral data derived by other means, and 
charaetcr or plotter -drawn line maps can 
be produced. An example of the latter is 
shown in Figure 3. 

Conefusfons 

The three foregoing applications illustrate 
the potential of cluster analysis techniques 
in dealing with a wide range of Classifica- 
tion problems. They also illustrate some 
of the difficulties associated with using 
cluster analysis: 

(a) There arc a large number of dif- 
ferent methods associated with the 
general term “cluster analysis” and 
the aver mast exercise care in 
choosing an appropriate one. 

(b) The user is generally required to 
make some judgmental decisions in 
using the technique, for instance, in 
the choice of number of groups 
desired. 

(c) Ihe user will general 1 ) have to ex- 
perimentally vary parameters and 
finally have to make a subjective 
judgment as to which result is 
"boat" 

Despite this, cluster analysis provides a 
relatively easy and cheap method of taking 
a first look at multivariate data. Because 



of the wide variety of clustering algorithm* 
uni their general robustness, continuous, 
discrete, or mixed data can be analyzed 
with equal facility. Ihe detection of 

groupings within Ihe data set nny provide 
suggestions for hypotheses which can then 
be tested by more refined and objective 
analytical techniques. 
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